de en es ja ko ru zh

basic-memory

by basicmachines-co

Python

AGPL 3.0

1,969

Overview InspectNew Endpoints Schema Related Servers Reviews Score

Need Help?View Source Code Report Issue

basic-memory
specs

SPEC-19 Sync Performance and Memory Optimization.md•37.5 kB

--- title: 'SPEC-19: Sync Performance and Memory Optimization' type: spec permalink: specs/spec-17-sync-performance-optimization tags: - performance - memory - sync - optimization - core status: draft --- # SPEC-19: Sync Performance and Memory Optimization ## Why ### Problem Statement Current sync implementation causes Out-of-Memory (OOM) kills and poor performance on production systems: **Evidence from Production**: - **Tenant-6d2ff1a3**: OOM killed on 1GB machine - Files: 2,621 total (31 PDFs, 80MB binary data) - Memory: 1.5-1.7GB peak usage - Sync duration: 15+ minutes - Error: `Out of memory: Killed process 693 (python)` **Root Causes**: 1. **Checksum-based scanning loads ALL files into memory** - `scan_directory()` computes checksums for ALL 2,624 files upfront - Results stored in multiple dicts (`ScanResult.files`, `SyncReport.checksums`) - Even unchanged files are fully read and checksummed 2. **Large files read entirely for checksums** - 16MB PDF → Full read into memory → Compute checksum - No streaming or chunked processing - TigrisFS caching compounds memory usage 3. **Unbounded concurrency** - All 2,624 files processed simultaneously - Each file loads full content into memory - No semaphore limiting concurrent operations 4. **Cloud-specific resource leaks** - aiohttp session leak in keepalive (not in context manager) - Circuit breaker resets every 30s sync cycle (ineffective) - Thundering herd: all tenants sync at :00 and :30 ### Impact - **Production stability**: OOM kills are unacceptable - **User experience**: 15+ minute syncs are too slow - **Cost**: Forced upgrades from 1GB → 2GB machines ($5-10/mo per tenant) - **Scalability**: Current approach won't scale to 100+ tenants ### Architectural Decision **Fix in basic-memory core first, NOT UberSync** Rationale: - Root causes are algorithmic, not architectural - Benefits all users (CLI + Cloud) - Lower risk than new centralized service - Known solutions (rsync/rclone use same pattern) - Can defer UberSync until metrics prove it necessary ## What ### Affected Components **basic-memory (core)**: - `src/basic_memory/sync/sync_service.py` - Core sync algorithm (~42KB) - `src/basic_memory/models.py` - Entity model (add mtime/size columns) - `src/basic_memory/file_utils.py` - Checksum computation functions - `src/basic_memory/repository/entity_repository.py` - Database queries - `alembic/versions/` - Database migration for schema changes **basic-memory-cloud (wrapper)**: - `apps/api/src/basic_memory_cloud_api/sync_worker.py` - Cloud sync wrapper - Circuit breaker implementation - Sync coordination logic ### Database Schema Changes Add to Entity model: ```python mtime: float # File modification timestamp size: int # File size in bytes ``` ## How (High Level) ### Phase 1: Core Algorithm Fixes (basic-memory) **Priority: P0 - Critical** #### 1.1 mtime-based Scanning (Issue #383) Replace expensive checksum-based scanning with lightweight stat-based comparison: ```python async def scan_directory(self, directory: Path) -> ScanResult: """Scan using mtime/size instead of checksums""" result = ScanResult() for root, dirnames, filenames in os.walk(str(directory)): for filename in filenames: rel_path = path.relative_to(directory).as_posix() stat = path.stat() # Store lightweight metadata instead of checksum result.files[rel_path] = { 'mtime': stat.st_mtime, 'size': stat.st_size } return result async def scan(self, directory: Path): """Compare mtime/size, only compute checksums for changed files""" db_state = await self.get_db_file_state() # Include mtime/size scan_result = await self.scan_directory(directory) for file_path, metadata in scan_result.files.items(): db_metadata = db_state.get(file_path) # Only compute expensive checksum if mtime/size changed if not db_metadata or metadata['mtime'] != db_metadata['mtime']: checksum = await self._compute_checksum_streaming(file_path) # Process immediately, don't accumulate in memory ``` **Benefits**: - No file reads during initial scan (just stat calls) - ~90% reduction in memory usage - ~10x faster scan phase - Only checksum files that actually changed #### 1.2 Streaming Checksum Computation (Issue #382) For large files (>1MB), use chunked reading to avoid loading entire file: ```python async def _compute_checksum_streaming(self, path: Path, chunk_size: int = 65536) -> str: """Compute checksum using 64KB chunks for large files""" hasher = hashlib.sha256() loop = asyncio.get_event_loop() def read_chunks(): with open(path, 'rb') as f: while chunk := f.read(chunk_size): hasher.update(chunk) await loop.run_in_executor(None, read_chunks) return hasher.hexdigest() async def _compute_checksum_async(self, file_path: Path) -> str: """Choose appropriate checksum method based on file size""" stat = file_path.stat() if stat.st_size > 1_048_576: # 1MB threshold return await self._compute_checksum_streaming(file_path) else: # Small files: existing fast path content = await self._read_file_async(file_path) return compute_checksum(content) ``` **Benefits**: - Constant memory usage regardless of file size - 16MB PDF uses 64KB memory (not 16MB) - Works well with TigrisFS network I/O #### 1.3 Bounded Concurrency (Issue #198) Add semaphore to limit concurrent file operations, or consider using aiofiles and async reads ```python class SyncService: def __init__(self, ...): # ... existing code ... self._file_semaphore = asyncio.Semaphore(10) # Max 10 concurrent self._max_tracked_failures = 100 # LRU cache limit async def _read_file_async(self, file_path: Path) -> str: async with self._file_semaphore: loop = asyncio.get_event_loop() return await loop.run_in_executor( self._thread_pool, file_path.read_text, "utf-8" ) async def _record_failure(self, path: str, error: str): # ... existing code ... # Implement LRU eviction if len(self._file_failures) > self._max_tracked_failures: self._file_failures.popitem(last=False) # Remove oldest ``` **Benefits**: - Maximum 10 files in memory at once (vs all 2,624) - 90%+ reduction in peak memory usage - Prevents unbounded memory growth on error-prone projects ### Phase 2: Cloud-Specific Fixes (basic-memory-cloud) **Priority: P1 - High** #### 2.1 Fix Resource Leaks ```python # apps/api/src/basic_memory_cloud_api/sync_worker.py async def send_keepalive(): """Send keepalive pings using proper session management""" # Use context manager to ensure cleanup async with aiohttp.ClientSession( timeout=aiohttp.ClientTimeout(total=5) ) as session: while True: try: await session.get(f"https://{fly_app_name}.fly.dev/health") await asyncio.sleep(10) except asyncio.CancelledError: raise # Exit cleanly except Exception as e: logger.warning(f"Keepalive failed: {e}") ``` #### 2.2 Improve Circuit Breaker Track failures across sync cycles instead of resetting every 30s: ```python # Persistent failure tracking class SyncWorker: def __init__(self): self._persistent_failures: Dict[str, int] = {} # file -> failure_count self._failure_window_start = time.time() async def should_skip_file(self, file_path: str) -> bool: # Skip files that failed >3 times in last hour if self._persistent_failures.get(file_path, 0) > 3: if time.time() - self._failure_window_start < 3600: return True return False ``` ### Phase 3: Measurement & Decision **Priority: P2 - Future** After implementing Phases 1-2, collect metrics for 2 weeks: - Memory usage per tenant sync - Sync duration (scan + process) - Concurrent sync load at peak times - OOM incidents - Resource costs **UberSync Decision Criteria**: Build centralized sync service ONLY if metrics show: - ✅ Core fixes insufficient for >100 tenants - ✅ Resource contention causing problems - ✅ Need for tenant tier prioritization (paid > free) - ✅ Cost savings justify complexity Otherwise, defer UberSync as premature optimization. ## How to Evaluate ### Success Metrics (Phase 1) **Memory Usage**: - ✅ Peak memory <500MB for 2,000+ file projects (was 1.5-1.7GB) - ✅ Memory usage linear with concurrent files (10 max), not total files - ✅ Large file memory usage: 64KB chunks (not 16MB) **Performance**: - ✅ Initial scan <30 seconds (was 5+ minutes) - ✅ Full sync <5 minutes for 2,000+ files (was 15+ minutes) - ✅ Subsequent syncs <10 seconds (only changed files) **Stability**: - ✅ 2,000+ file projects run on 1GB machines - ✅ Zero OOM kills in production - ✅ No degradation with binary files (PDFs, images) ### Success Metrics (Phase 2) **Resource Management**: - ✅ Zero aiohttp session leaks (verified via monitoring) - ✅ Circuit breaker prevents repeated failures (>3 fails = skip for 1 hour) - ✅ Tenant syncs distributed over 30s window (no thundering herd) **Observability**: - ✅ Logfire traces show memory usage per sync - ✅ Clear logging of skipped files and reasons - ✅ Metrics on sync duration, file counts, failure rates ### Test Plan **Unit Tests** (basic-memory): - mtime comparison logic - Streaming checksum correctness - Semaphore limiting (mock 100 files, verify max 10 concurrent) - LRU cache eviction - Checksum computation: streaming vs non-streaming equivalence **Integration Tests** (basic-memory): - Large file handling (create 20MB test file) - Mixed file types (text + binary) - Changed file detection via mtime - Sync with 1,000+ files **Load Tests** (basic-memory-cloud): - Test on tenant-6d2ff1a3 (2,621 files, 31 PDFs) - Monitor memory during full sync with Logfire - Measure scan and sync duration - Run on 1GB machine (downgrade from 2GB to verify) - Simulate 10 concurrent tenant syncs **Regression Tests**: - Verify existing sync scenarios still work - CLI sync behavior unchanged - File watcher integration unaffected ### Performance Benchmarks Establish baseline, then compare after each phase: | Metric | Baseline | Phase 1 Target | Phase 2 Target | |--------|----------|----------------|----------------| | Peak Memory (2,600 files) | 1.5-1.7GB | <500MB | <450MB | | Initial Scan Time | 5+ min | <30 sec | <30 sec | | Full Sync Time | 15+ min | <5 min | <5 min | | Subsequent Sync | 2+ min | <10 sec | <10 sec | | OOM Incidents/Week | 2-3 | 0 | 0 | | Min RAM Required | 2GB | 1GB | 1GB | ## Implementation Phases ### Phase 0.5: Database Schema & Streaming Foundation **Priority: P0 - Required for Phase 1** This phase establishes the foundation for streaming sync with mtime-based change detection. **Database Schema Changes**: - [x] Add `mtime` column to Entity model (REAL type for float timestamp) - [x] Add `size` column to Entity model (INTEGER type for file size in bytes) - [x] Create Alembic migration for new columns (nullable initially) - [x] Add indexes on `(file_path, project_id)` for optimistic upsert performance - [ ] Backfill existing entities with mtime/size from filesystem **Streaming Architecture**: - [x] Replace `os.walk()` with `os.scandir()` for cached stat info - [ ] Eliminate `get_db_file_state()` - no upfront SELECT all entities - [x] Implement streaming iterator `_scan_directory_streaming()` - [x] Add `get_by_file_path()` optimized query (single file lookup) - [x] Add `get_all_file_paths()` for deletion detection (paths only, no entities) **Benefits**: - **50% fewer network calls** on Tigris (scandir returns cached stat) - **No large dicts in memory** (process files one at a time) - **Indexed lookups** instead of full table scan - **Foundation for mtime comparison** (Phase 1) **Code Changes**: ```python # Before: Load all entities upfront db_paths = await self.get_db_file_state() # SELECT * FROM entity WHERE project_id = ? scan_result = await self.scan_directory() # os.walk() + stat() per file # After: Stream and query incrementally async for file_path, stat_info in self.scan_directory(): # scandir() with cached stat db_entity = await self.entity_repository.get_by_file_path(rel_path) # Indexed lookup # Process immediately, no accumulation ``` **Files Modified**: - `src/basic_memory/models.py` - Add mtime/size columns - `alembic/versions/xxx_add_mtime_size.py` - Migration - `src/basic_memory/sync/sync_service.py` - Streaming implementation - `src/basic_memory/repository/entity_repository.py` - Add get_all_file_paths() **Migration Strategy**: ```sql -- Migration: Add nullable columns ALTER TABLE entity ADD COLUMN mtime REAL; ALTER TABLE entity ADD COLUMN size INTEGER; -- Backfill from filesystem during first sync after upgrade -- (Handled in sync_service on first scan) ``` ### Phase 1: Core Fixes **mtime-based scanning**: - [x] Add mtime/size columns to Entity model (completed in Phase 0.5) - [x] Database migration (alembic) (completed in Phase 0.5) - [x] Refactor `scan()` to use streaming architecture with mtime/size comparison - [x] Update `sync_markdown_file()` and `sync_regular_file()` to store mtime/size in database - [x] Only compute checksums for changed files (mtime/size differ) - [x] Unit tests for streaming scan (6 tests passing) - [ ] Integration test with 1,000 files (defer to benchmarks) **Streaming checksums**: - [x] Implement `_compute_checksum_streaming()` with chunked reading - [x] Add file size threshold logic (1MB) - [x] Test with large files (16MB PDF) - [x] Verify memory usage stays constant - [x] Test checksum equivalence (streaming vs non-streaming) **Bounded concurrency**: - [x] Add semaphore (10 concurrent) to `_read_file_async()` (already existed) - [x] Add LRU cache for failures (100 max) (already existed) - [ ] Review thread pool size configuration - [ ] Load test with 2,000+ files - [ ] Verify <500MB peak memory **Cleanup & Optimization**: - [x] Eliminate `get_db_file_state()` - no upfront SELECT all entities (streaming architecture complete) - [x] Consolidate file operations in FileService (eliminate duplicate checksum logic) - [x] Add aiofiles dependency (already present) - [x] FileService streaming checksums for files >1MB - [x] SyncService delegates all file operations to FileService - [x] Complete true async I/O refactoring - all file operations use aiofiles - [x] Added `FileService.read_file_content()` using aiofiles - [x] Removed `SyncService._read_file_async()` wrapper method - [x] Removed `SyncService._compute_checksum_async()` wrapper method - [x] Inlined all 7 checksum calls to use `file_service.compute_checksum()` directly - [x] All file I/O operations now properly consolidated in FileService with non-blocking I/O - [x] Removed sync_status_service completely (unnecessary complexity and state tracking) - [x] Removed `sync_status_service.py` and `sync_status` MCP tool - [x] Removed all `sync_status_tracker` calls from `sync_service.py` - [x] Removed migration status checks from MCP tools (`write_note`, `read_note`, `build_context`) - [x] Removed `check_migration_status()` and `wait_for_migration_or_return_status()` from `utils.py` - [x] Removed all related tests (4 test files deleted) - [x] All 1184 tests passing **Phase 1 Implementation Summary:** Phase 1 is now complete with all core fixes implemented and tested: 1. **Streaming Architecture** (Phase 0.5 + Phase 1): - Replaced `os.walk()` with `os.scandir()` for cached stat info - Eliminated upfront `get_db_file_state()` SELECT query - Implemented `_scan_directory_streaming()` for incremental processing - Added indexed `get_by_file_path()` lookups - Result: 50% fewer network calls on TigrisFS, no large dicts in memory 2. **mtime-based Change Detection**: - Added `mtime` and `size` columns to Entity model - Alembic migration completed and deployed - Only compute checksums when mtime/size differs from database - Result: ~90% reduction in checksum operations during typical syncs 3. **True Async I/O with aiofiles**: - All file operations consolidated in FileService - `FileService.compute_checksum()`: 64KB chunked reading for constant memory (lines 261-296 of file_service.py) - `FileService.read_file_content()`: Non-blocking file reads with aiofiles (lines 160-193 of file_service.py) - Removed all wrapper methods from SyncService (`_read_file_async`, `_compute_checksum_async`) - Semaphore controls concurrency (max 10 concurrent file operations) - Result: Constant memory usage regardless of file size, true non-blocking I/O 4. **Test Coverage**: - 41/43 sync tests passing (2 skipped as expected) - Circuit breaker tests updated for new architecture - Streaming checksum equivalence verified - All edge cases covered (large files, concurrent operations, failures) **Key Files Modified**: - `src/basic_memory/models.py` - Added mtime/size columns - `alembic/versions/xxx_add_mtime_size.py` - Database migration - `src/basic_memory/sync/sync_service.py` - Streaming implementation, removed wrapper methods - `src/basic_memory/services/file_service.py` - Added `read_file_content()`, streaming checksums - `src/basic_memory/repository/entity_repository.py` - Added `get_all_file_paths()` - `tests/sync/test_sync_service.py` - Updated circuit breaker test mocks **Performance Improvements Achieved**: - Memory usage: Constant per file (64KB chunks) vs full file in memory - Scan speed: Stat-only scan (no checksums for unchanged files) - I/O efficiency: True async with aiofiles (no thread pool blocking) - Network efficiency: 50% fewer calls on TigrisFS via scandir caching - Architecture: Clean separation of concerns (FileService owns all file I/O) - Reduced complexity: Removed unnecessary sync_status_service state tracking **Observability**: - [x] Added Logfire instrumentation to `sync_file()` and `sync_markdown_file()` - [x] Logfire disabled by default via `ignore_no_config = true` in pyproject.toml - [x] No telemetry in FOSS version unless explicitly configured - [x] Cloud deployment can enable Logfire for performance monitoring **Next Steps**: Phase 1.5 scan watermark optimization for large project performance. ### Phase 1.5: Scan Watermark Optimization **Priority: P0 - Critical for Large Projects** This phase addresses Issue #388 where large projects (1,460+ files) take 7+ minutes for sync operations even when no files have changed. **Problem Analysis**: From production data (tenant-0a20eb58): - Total sync time: 420-450 seconds (7+ minutes) with 0 changes - Scan phase: 321 seconds (75% of total time) - Per-file cost: 220ms × 1,460 files = 5+ minutes - Root cause: Network I/O to TigrisFS for stat operations (even with mtime columns) - 15 concurrent syncs every 30 seconds compounds the problem **Current Behavior** (Phase 1): ```python async def scan(self, directory: Path): """Scan filesystem using mtime/size comparison""" # Still stats ALL 1,460 files every sync cycle async for file_path, stat_info in self._scan_directory_streaming(): db_entity = await self.entity_repository.get_by_file_path(file_path) # Compare mtime/size, skip unchanged files # Only checksum if changed (✅ already optimized) ``` **Problem**: Even with mtime optimization, we stat every file on every scan. On TigrisFS (network FUSE mount), this means 1,460 network calls taking 5+ minutes. **Solution: Scan Watermark + File Count Detection** Track when we last scanned and how many files existed. Use filesystem-level filtering to only examine files modified since last scan. **Key Insight**: File count changes signal deletions - Count same → incremental scan (95% of syncs) - Count increased → new files found by incremental (4% of syncs) - Count decreased → files deleted, need full scan (1% of syncs) **Database Schema Changes**: Add to Project model: ```python last_scan_timestamp: float | None # Unix timestamp of last successful scan start last_file_count: int | None # Number of files found in last scan ``` **Implementation Strategy**: ```python async def scan(self, directory: Path): """Smart scan using watermark and file count""" project = await self.project_repository.get_current() # Step 1: Quick file count (fast on TigrisFS: 1.4s for 1,460 files) current_count = await self._quick_count_files(directory) # Step 2: Determine scan strategy if project.last_file_count is None: # First sync ever → full scan file_paths = await self._scan_directory_full(directory) scan_type = "full_initial" elif current_count < project.last_file_count: # Files deleted → need full scan to detect which ones file_paths = await self._scan_directory_full(directory) scan_type = "full_deletions" logger.info(f"File count decreased ({project.last_file_count} → {current_count}), running full scan") elif project.last_scan_timestamp is not None: # Incremental scan: only files modified since last scan file_paths = await self._scan_directory_modified_since( directory, project.last_scan_timestamp ) scan_type = "incremental" logger.info(f"Incremental scan since {project.last_scan_timestamp}, found {len(file_paths)} changed files") else: # Fallback to full scan file_paths = await self._scan_directory_full(directory) scan_type = "full_fallback" # Step 3: Process changed files (existing logic) for file_path in file_paths: await self._process_file(file_path) # Step 4: Update watermark AFTER successful scan await self.project_repository.update( project.id, last_scan_timestamp=time.time(), # Start of THIS scan last_file_count=current_count ) # Step 5: Record metrics logfire.metric_counter(f"sync.scan.{scan_type}").add(1) logfire.metric_histogram("sync.scan.files_scanned", unit="files").record(len(file_paths)) ``` **Helper Methods**: ```python async def _quick_count_files(self, directory: Path) -> int: """Fast file count using find command""" # TigrisFS: 1.4s for 1,460 files result = await asyncio.create_subprocess_shell( f'find "{directory}" -type f | wc -l', stdout=asyncio.subprocess.PIPE ) stdout, _ = await result.communicate() return int(stdout.strip()) async def _scan_directory_modified_since( self, directory: Path, since_timestamp: float ) -> List[str]: """Use find -newermt for filesystem-level filtering""" # Convert timestamp to find-compatible format since_date = datetime.fromtimestamp(since_timestamp).strftime("%Y-%m-%d %H:%M:%S") # TigrisFS: 0.2s for 0 changed files (vs 5+ minutes for full scan) result = await asyncio.create_subprocess_shell( f'find "{directory}" -type f -newermt "{since_date}"', stdout=asyncio.subprocess.PIPE ) stdout, _ = await result.communicate() # Convert absolute paths to relative file_paths = [] for line in stdout.decode().splitlines(): if line: rel_path = Path(line).relative_to(directory).as_posix() file_paths.append(rel_path) return file_paths ``` **TigrisFS Testing Results** (SSH to production-basic-memory-tenant-0a20eb58): ```bash # Full file count $ time find . -type f | wc -l 1460 real 0m1.362s # ✅ Acceptable # Incremental scan (1 hour window) $ time find . -type f -newermt "2025-01-20 10:00:00" | wc -l 0 real 0m0.161s # ✅ 8.5x faster! # Incremental scan (24 hours) $ time find . -type f -newermt "2025-01-19 11:00:00" | wc -l 0 real 0m0.239s # ✅ 5.7x faster! ``` **Conclusion**: `find -newermt` works perfectly on TigrisFS and provides massive speedup. **Expected Performance Improvements**: | Scenario | Files Changed | Current Time | With Watermark | Speedup | |----------|---------------|--------------|----------------|---------| | No changes (common) | 0 | 420s | ~2s | 210x | | Few changes | 5-10 | 420s | ~5s | 84x | | Many changes | 100+ | 420s | ~30s | 14x | | Deletions (rare) | N/A | 420s | 420s | 1x | **Full sync breakdown** (1,460 files, 0 changes): - File count: 1.4s - Incremental scan: 0.2s - Database updates: 0.4s - **Total: ~2s (225x faster)** **Metrics to Track**: ```python # Scan type distribution logfire.metric_counter("sync.scan.full_initial").add(1) logfire.metric_counter("sync.scan.full_deletions").add(1) logfire.metric_counter("sync.scan.incremental").add(1) # Performance metrics logfire.metric_histogram("sync.scan.duration", unit="ms").record(scan_ms) logfire.metric_histogram("sync.scan.files_scanned", unit="files").record(file_count) logfire.metric_histogram("sync.scan.files_changed", unit="files").record(changed_count) # Watermark effectiveness logfire.metric_histogram("sync.scan.watermark_age", unit="s").record( time.time() - project.last_scan_timestamp ) ``` **Edge Cases Handled**: 1. **First sync**: No watermark → full scan (expected) 2. **Deletions**: File count decreased → full scan (rare but correct) 3. **Clock skew**: Use scan start time, not end time (captures files created during scan) 4. **Scan failure**: Don't update watermark on failure (retry will re-scan) 5. **New files**: Count increased → incremental scan finds them (common, fast) **Files to Modify**: - `src/basic_memory/models.py` - Add last_scan_timestamp, last_file_count to Project - `alembic/versions/xxx_add_scan_watermark.py` - Migration for new columns - `src/basic_memory/sync/sync_service.py` - Implement watermark logic - `src/basic_memory/repository/project_repository.py` - Update methods - `tests/sync/test_sync_watermark.py` - Test watermark behavior **Test Plan**: - [x] SSH test on TigrisFS confirms `find -newermt` works (completed) - [x] Unit tests for scan strategy selection (4 tests) - [x] Unit tests for file count detection (integrated in strategy tests) - [x] Integration test: verify incremental scan finds changed files (4 tests) - [x] Integration test: verify deletion detection triggers full scan (2 tests) - [ ] Load test on tenant-0a20eb58 (1,460 files) - pending production deployment - [ ] Verify <3s for no-change sync - pending production deployment **Implementation Status**: ✅ **COMPLETED** **Code Changes** (Commit: `fb16055d`): - ✅ Added `last_scan_timestamp` and `last_file_count` to Project model - ✅ Created database migration `e7e1f4367280_add_scan_watermark_tracking_to_project.py` - ✅ Implemented smart scan strategy selection in `sync_service.py` - ✅ Added `_quick_count_files()` using `find | wc -l` (~1.4s for 1,460 files) - ✅ Added `_scan_directory_modified_since()` using `find -newermt` (~0.2s) - ✅ Added `_scan_directory_full()` wrapper for full scans - ✅ Watermark update logic after successful sync (uses sync START time) - ✅ Logfire metrics for scan types and performance tracking **Test Coverage** (18 tests in `test_sync_service_incremental.py`): - ✅ Scan strategy selection (4 tests) - First sync uses full scan - File count decreased triggers full scan - Same file count uses incremental scan - Increased file count uses incremental scan - ✅ Incremental scan base cases (4 tests) - No changes scenario - Detects new files - Detects modified files - Detects multiple changes - ✅ Deletion detection (2 tests) - Single file deletion - Multiple file deletions - ✅ Move detection (2 tests) - Moves require full scan (renames don't update mtime) - Moves detected in full scan via checksum - ✅ Watermark update (3 tests) - Watermark updated after successful sync - Watermark uses sync start time - File count accuracy - ✅ Edge cases (3 tests) - Concurrent file changes - Empty directory handling - Respects .gitignore patterns **Performance Expectations** (to be verified in production): - No changes: 420s → ~2s (210x faster) - Few changes (5-10): 420s → ~5s (84x faster) - Many changes (100+): 420s → ~30s (14x faster) - Deletions: 420s → 420s (full scan, rare case) **Rollout Strategy**: 1. ✅ Code complete and tested (18 new tests, all passing) 2. ✅ Pushed to `phase-0.5-streaming-foundation` branch 3. ⏳ Windows CI tests running 4. 📊 Deploy to staging tenant with watermark optimization 5. 📊 Monitor scan performance metrics via Logfire 6. 📊 Verify no missed files (compare full vs incremental results) 7. 📊 Deploy to production tenant-0a20eb58 8. 📊 Measure actual improvement (expect 420s → 2-3s) **Success Criteria**: - ✅ Implementation complete with comprehensive tests - [ ] No-change syncs complete in <3 seconds (was 420s) - pending production test - [ ] Incremental scans (95% of cases) use watermark - pending production test - [ ] Deletion detection works correctly (full scan when needed) - tested in unit tests ✅ - [ ] No files missed due to watermark logic - tested in unit tests ✅ - [ ] Metrics show scan type distribution matches expectations - pending production test **Next Steps**: 1. Production deployment to tenant-0a20eb58 2. Measure actual performance improvements 3. Monitor metrics for 1 week 4. Phase 2 cloud-specific fixes 5. Phase 3 production measurement and UberSync decision ### Phase 2: Cloud Fixes **Resource leaks**: - [ ] Fix aiohttp session context manager - [ ] Implement persistent circuit breaker - [ ] Add memory monitoring/alerts - [ ] Test on production tenant **Sync coordination**: - [ ] Implement hash-based staggering - [ ] Add jitter to sync intervals - [ ] Load test with 10 concurrent tenants - [ ] Verify no thundering herd ### Phase 3: Measurement **Deploy to production**: - [ ] Deploy Phase 1+2 changes - [ ] Downgrade tenant-6d2ff1a3 to 1GB - [ ] Monitor for OOM incidents **Collect metrics**: - [ ] Memory usage patterns - [ ] Sync duration distributions - [ ] Concurrent sync load - [ ] Cost analysis **UberSync decision**: - [ ] Review metrics against decision criteria - [ ] Document findings - [ ] Create SPEC-18 for UberSync if needed ## Related Issues ### basic-memory (core) - [#383](https://github.com/basicmachines-co/basic-memory/issues/383) - Refactor sync to use mtime-based scanning - [#382](https://github.com/basicmachines-co/basic-memory/issues/382) - Optimize memory for large file syncs - [#371](https://github.com/basicmachines-co/basic-memory/issues/371) - aiofiles for non-blocking I/O (future) ### basic-memory-cloud - [#198](https://github.com/basicmachines-co/basic-memory-cloud/issues/198) - Memory optimization for sync worker - [#189](https://github.com/basicmachines-co/basic-memory-cloud/issues/189) - Circuit breaker for infinite retry loops ## References **Standard sync tools using mtime**: - rsync: Uses mtime-based comparison by default, only checksums on `--checksum` flag - rclone: Default is mtime/size, `--checksum` mode optional - syncthing: Block-level sync with mtime tracking **fsnotify polling** (future consideration): - [fsnotify/fsnotify#9](https://github.com/fsnotify/fsnotify/issues/9) - Polling mode for network filesystems ## Notes ### Why Not UberSync Now? **Premature Optimization**: - Current problems are algorithmic, not architectural - No evidence that multi-tenant coordination is the issue - Single tenant OOM proves algorithm is the problem **Benefits of Core-First Approach**: - ✅ Helps all users (CLI + Cloud) - ✅ Lower risk (no new service) - ✅ Clear path (issues specify fixes) - ✅ Can defer UberSync until proven necessary **When UberSync Makes Sense**: - >100 active tenants causing resource contention - Need for tenant tier prioritization (paid > free) - Centralized observability requirements - Cost optimization at scale ### Migration Strategy **Backward Compatibility**: - New mtime/size columns nullable initially - Existing entities sync normally (compute mtime on first scan) - No breaking changes to MCP API - CLI behavior unchanged **Rollout**: 1. Deploy to staging with test tenant 2. Validate memory/performance improvements 3. Deploy to production (blue-green) 4. Monitor for 1 week 5. Downgrade tenant machines if successful ## Further Considerations ### Version Control System (VCS) Integration **Context:** Users frequently request git versioning, and large projects with PDFs/images pose memory challenges. #### Git-Based Sync **Approach:** Use git for change detection instead of custom mtime comparison. ```python # Git automatically tracks changes repo = git.Repo(project_path) repo.git.add(A=True) diff = repo.index.diff('HEAD') for change in diff: if change.change_type == 'M': # Modified await sync_file(change.b_path) ``` **Pros:** - ✅ Proven, battle-tested change detection - ✅ Built-in rename/move detection (similarity index) - ✅ Efficient for cloud sync (git protocol over HTTP) - ✅ Could enable version history as bonus feature - ✅ Users want git integration anyway **Cons:** - ❌ User confusion (`.git` folder in knowledge base) - ❌ Conflicts with existing git repos (submodule complexity) - ❌ Adds dependency (git binary or dulwich/pygit2) - ❌ Less control over sync logic - ❌ Doesn't solve large file problem (PDFs still checksummed) - ❌ Git LFS adds complexity #### Jujutsu (jj) Alternative **Why jj is compelling:** 1. **Working Copy as Source of Truth** - Git: Staging area is intermediate state - Jujutsu: Working copy IS a commit - Aligns with "files are source of truth" philosophy! 2. **Automatic Change Tracking** - No manual staging required - Working copy changes tracked automatically - Better fit for sync operations vs git's commit-centric model 3. **Conflict Handling** - User edits + sync changes both preserved - Operation log vs linear history - Built for operations, not just history **Cons:** - ❌ New/immature (2020 vs git's 2005) - ❌ Not universally available - ❌ Steeper learning curve for users - ❌ No LFS equivalent yet - ❌ Still doesn't solve large file checksumming #### Git Index Format (Hybrid Approach) **Best of both worlds:** Use git's index format without full git repo. ```python from dulwich.index import Index # Pure Python # Use git index format for tracking idx = Index(project_path / '.basic-memory' / 'index') for file in files: stat = file.stat() if idx.get(file) and idx[file].mtime == stat.st_mtime: continue # Unchanged (git's proven logic) await sync_file(file) idx[file] = (stat.st_mtime, stat.st_size, sha) ``` **Pros:** - ✅ Git's proven change detection logic - ✅ No user-visible `.git` folder - ✅ No git dependency (pure Python) - ✅ Full control over sync **Cons:** - ❌ Adds dependency (dulwich) - ❌ Doesn't solve large files - ❌ No built-in versioning ### Large File Handling **Problem:** PDFs/images cause memory issues regardless of VCS choice. **Solutions (Phase 1+):** **1. Skip Checksums for Large Files** ```python if stat.st_size > 10_000_000: # 10MB threshold checksum = None # Use mtime/size only logger.info(f"Skipping checksum for {file_path}") ``` **2. Partial Hashing** ```python if file.suffix in ['.pdf', '.jpg', '.png']: # Hash first/last 64KB instead of entire file checksum = hash_partial(file, chunk_size=65536) ``` **3. External Blob Storage** ```python if stat.st_size > 10_000_000: blob_id = await upload_to_tigris_blob(file) entity.blob_id = blob_id entity.file_path = None # Not in main sync ``` ### Recommendation & Timeline **Phase 0.5-1 (Now):** Custom streaming + mtime - ✅ Solves urgent memory issues - ✅ No dependencies - ✅ Full control - ✅ Skip checksums for large files (>10MB) - ✅ Proven pattern (rsync/rclone) **Phase 2 (After metrics):** Git index format exploration ```python # Optional: Use git index for tracking if beneficial from dulwich.index import Index # No git repo, just index file format ``` **Future (User feature):** User-facing versioning ```python # Let users opt into VCS: basic-memory config set versioning git basic-memory config set versioning jj basic-memory config set versioning none # Current behavior # Integrate with their chosen workflow # Not forced upon them ``` **Rationale:** 1. **Don't block on VCS decision** - Memory issues are P0 2. **Learn from deployment** - See actual usage patterns 3. **Keep options open** - Can add git/jj later 4. **Files as source of truth** - Core philosophy preserved 5. **Large files need attention regardless** - VCS won't solve that **Decision Point:** - If Phase 0.5/1 achieves memory targets → VCS integration deferred - If users strongly request versioning → Add as opt-in feature - If change detection becomes bottleneck → Explore git index format ## Agent Assignment **Phase 1 Implementation**: `python-developer` agent - Expertise in FastAPI, async Python, database migrations - Handles basic-memory core changes **Phase 2 Implementation**: `python-developer` agent - Same agent continues with cloud-specific fixes - Maintains consistency across phases **Phase 3 Review**: `system-architect` agent - Analyzes metrics and makes UberSync decision - Creates SPEC-18 if centralized service needed

MCP directory API

We provide all the information about MCP servers via our MCP API.

curl -X GET 'https://glama.ai/api/mcp/v1/servers/basicmachines-co/basic-memory'

If you have feedback or need assistance with the MCP directory API, please join our Discord server